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Introduction 


Techniques  and  networks  for  learning  from  examples  may  be  considered  as 
methods  for  approximating  multivariate  functions  from  sparse  data  (the  “ex¬ 
amples”).  In  recent  papers  we  have  developed  one  such  technique  that  we 
have  called  regularization  networks,  of  which  Radial  Bctsis  Functions  are  a 
special  case  and  Hyper  Basis  Functions  are  the  most  general  form  (see  for 
a  review  Poggio  and  Girosi,  1990  and  references  therein).  The  underlying 
theory  is  quite  well  developed  and  understood:  for  instance  the  role  of  the 
hidden  units  in  the  corresponding  network  (see  Fig.  (1))  is  easily  explained. 
The  method  has  been  also  demonstrated  to  work  well  in  a  number  of  practical 
applications.  Another  technique,  which  is  extremely  popular,  is  associated 
with  multilayer  perceptrons,  typically  used  in  conjunction  with  a  version  of 
gradient  descent  (for  estimating  the  parameters)  called  backpropagation.  In 
this  paper,  we  will  consider  multilayer  perceptrons  with  one  hidden  layer  and 
linear  output  units.  These  networks  have  been  used  successfully  in  many 
cases  of  learning  from  examples.  The  underlying  theory  is  less  developed, 
though  a  few  theoretical  results  have  been  obtained  in  recent  months.  In 
particular,  it  has  been  proved  that  MLP  have  what  we  call  the  Weierstrass 
property,  that  is  they  can  approximate  arbitrarily  well  -  provided  enough 
units  are  available  -  any  continuous  function  on  a  compact  interval  (Homik, 
Stinchcombe  and  White,  1989;  Stinchcombe  and  White,  1989;  Carroll  and 
Dickinson,  1989;  Cybenko,  1989;  Funahashi,  1989).  Regularization  networks 
also  have  this  property,  shared  by  many  other  approximation  schemes  (Girosi 
and  Poggio,  1989). 

It  is  natural  to  explore  the  relation  between  the  two  techniques,  especially 
because  the  corresponding  networks  have  superficially  a  similar  structure, 
both  having  one  hidden  layer  of  units  as  shown  by  Fig.  (1). 

The  network  of  Fig.  (1)  represents  the  class  of  functions  of  the  type 


/(x)  =  ^c,^f.(x) 


(1) 


isl 


where  Hi  are  functions  that  depend  on  some  parameters  to  be  estimated 
together  with  the  coefficients  c,-.  When  the  functions  Hi  axe  kept  fixed,  the  ‘'I 
function  /(x)  is  linear  in  its  parameters  (the  c,),  and  the  resulting  approxi-  □ 

mation  scheme  is  linear.  D 
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Figure  1:  The  most  general  network  with  one  layer  of  hidden  units.  Here 
we  show  the  two-dimensional  case,  in  which  x  =  {x,y).  Each  function  Hi 
can  depend  on  a  set  of  unknown  parameters,  that  are  computed  during  the 
learning  phase,  as  well  as  the  coefficients  Cj.  When  Hi  =  •  x  -I-  Oi)  the 

network  is  a  multilayer  perceptron;  when  Hi  =  h{x  —  t,)  the  netwr.k  is  a 
regularization  network  for  appropriate  choices  of  h. 
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Depending  on  the  function  Hi  different  approximation  schemes  can  be 
obtained.  Two  conunon  choices  are  the  following: 

1.  Regularization  networks,  that  correspond  to  the  choice: 

/f,(x)  =  hi\\x  -  t,||w) 

where  /i  is  a  (conditionally)  positive  definite  function,  the  t,-  are  d- 
dimensional  vectors  called  “centers”,  W  is  a  d  x  d  matrix,  and  we  have 
defined  the  weighted  norm  ||  •  |(w  as 

||x||W  =  x^W^Wx  . 

We  call  this  scheme  Hyperbf.  RBF  is  the  case  in  which  the  centers 
coincide  with  the  data  points,  and  GRBF  (Generalized  Radial  Basis 
Functions)  is  the  case  in  which  the  centers  t^  are  free  parameters  to  be 
estimated  but  the  weight  matrix  W  is  fixed  and  equal  to  the  identity 
matrix. 

This  class  of  approximation  schemes  can  be  formally  derived,  in  the 
framework  of  regularization  theory,  from  a  variational  principle  which 
has  a  Bayesian  interpretation.  It  includes: 

•  kernel  estimation  techniques  and  Parzen  windows 

•  splines 

•  Hyperbf  and  RBF 

2.  Ridge  functions  approximation: 


Hi{x)  =  hi{x  •  w,  +  0i) 

where  the  Wj  are  d-dimensional  vectors  called  “weights”,  and  the  pa¬ 
rameters  0i  constitutes  bias  terms.  This  form  of  approximation  did  not 
have  until  now  any  variational  or  Bayesian  interpretation.  Very  recent 
results  by  Poggio  and  Girosi  (unpublished)  show  that  a  set  of  ridge 
function  approximation  schemes  can  be  derived  as  limits  of  regulariza¬ 
tion  networks  for  an  appropriate  class  of  stabilizers.  Several  techniques 
are  included  in  this  class: 
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•  Projection  Pursuit  Regression  (PPR):  it  corresponds  to  an 
expansion  of  the  type  eq.  (1)  in  which  all  the  coefficients  c,-  are 
set  to  1,  and 


Hi{x)  =  hi(x  •  w.) 

where  the  vectors  Wj  are  normalized  (||w,||  =  1).  The  functions 
hi  are  determined  by  means  of  some  nonparametric  estimation 
technique,  in  the  following  iterative  way: 

(a)  Assume  that  we  already  have  k  —  1  terms.  Let 


ri  =  y.  - 

i=i 

be  the  residual  of  the  approximation; 

(b)  Search  for  the  next  term.  Calculate  the  following  sum  of  the 
residulas 


«=i 

and  find  the  direction  which  minimize  the  sum  of  residuals 
and  the  corresponding  function  hk. 

•  Flexible  Fourier  series:  the  function  hi  are  all  equal  to  the 
cosine  (or  sine)  function,  and  therefore: 


Hi{x)  =  cos(x  •  w,  +  0i)  .  (2) 

If  we  assume  that  the  function  g  underlying  the  data  has  a  Fourier 
transform  y(s)  =  |^(s)lc**^*^  then 

#(x)  =  »  ^  *  |j(8)|cos(s-x+«(s)) , 

(3) 

and  the  expansion  of  eq.  (1)  with  the  choice  (2)  looks  like  a  cu- 
bature  formula  for  the  integral  above,  where  the  vectors  w,-  are 
the  points  at  which  the  integrand  is  sampled.  In  this  case  the 
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interpretation  of  the  “weights”  w,-  is  clear:  they  represent  the 
fundamental  frequency  content  of  the  underlying  function.  It  can 
also  be  noticed  that  the  problem  of  finding  the  optimal  param¬ 
eters  of  the  parametric  function  is  equivalent  to  the  problem  of 
finding  the  optimal  sample  points  to  use  in  the  discretization  of 
the  integral  above. 

•  Multilayer  Perceptrons:  the  functions  hi  are  all  equal  to  a 
sigmoidal  function: 


Hi{x)  =  cr(x  •  w,  -I-  Oi)  .  (4) 

The  function  a  is  usually  defined  as 


but  other  choices,  (as  the  hyperbolic  tangent),  are  equivalent,  as 
long  as  the  sigmoidal  shape  is  mantained. 

While  in  the  approximation  by  trigonometric  functions  the  pa¬ 
rameters  Wj  have  a  simple  interpretation  in  terms  of  frequencies, 
in  this  case  the  meaning  of  the  w,  is  less  clear.  If  the  expansion 
(4)  is  considered  from  the  point  of  view  of  projection  pursuit  re¬ 
gression,  (Friedman  and  Stuetzle,  1981;  Huber,  1985)  the  w,  are 
the  “relevant  directions”  that  are  supposed  to  encode  the  most 
information  about  the  function. 

•  Exponential  sums:  when  the  problem  is  one  dimensional,  a 
well  known  non  linear  technique  consists  in  approximation  by  ex¬ 
ponential  sums  (Braess,  1986;  Gosselin,  1986;  Hobby  and  Rice, 
1967),  and  a  number  of  results  are  available  in  this  case.  In  more 
than  one  dimension,  the  natural  extension  corresponds  to  ridge 
function  approximation  with  the  choice: 

Hi{x)  =  .  (5) 

Notice  that  the  bias  terms  disappear,  since  they  are  absorbed  by 
the  coefficients. 
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We  group  the  ridge  function  approximation  schemes  together  because 
they  all  have  the  same  general  form:  linear  combination  of  nonlinear 
function  of  the  scalar  product  of  the  input  vector  with  the  parameter 
vector. 

The  main  difference  between  these  two  classes  of  approximation  schemes 
(which,  as  we  mentioned  earlier,  can  be  both  derived  from  regularization  in 
terms  of  two  different  classes  of  stabilizers,  reflecting  different  prior  assum|>- 
tions  seems  to  be  related  to  the  use  of  the  scalar  product  x  ■  w,-  instead  of 
the  weighted  Euclidean  distance  ||x  —  t,|jw»  as  argument  of  the  parametric 
functions  h. 

At  first  sight,  these  two  broad  classes  of  techniques  do  not  seem  to  have 
any  relationship.  The  main  point  of  this  paper  is  to  show  that  in  some  special 
situations  there  is  a  close  connection  between  these  two  classes  of  approx¬ 
imation  schemes  and  in  particular  between  Gaussian  GRBF  (i.e.  Hyperbf 
networks  with  W  =  /)  and  MLP  with  sigmoidal  units  in  the  hidden  layer. 


2  Normalized  Inputs:  Ridge  Functions  are 
Radial  Functions 

In  the  case  in  which  all  the  inputs  variables  are  normalized,  that  is  they  lie 
on  the  unit  <f-dimensional  sphere,  there  is  a  simple  connection  between  ridge 
and  radial  functions.  In  fact  the  following  identity 

l|x  - 1||^  =  ||x||*  +  ||t|r  -  2  X  .  t  (6) 

for  ||x||  =  1  becomes: 

||x-tl|"  =  l  +  ||tp-2x  -t  (7) 

We  can  now  use  identity  (7)  to  obtain  a  ridge  regression  scheme  from  a 
Radial  Basis  Functions  scheme  and  vice  versa. 

^The  formulation  of  the  learning  problem  in  terms  of  regularization  is  satisfying  from 
a  theoretical  point  of  view,  since  it  establishes  connections  with  a  large  body  of  results  in 
the  area  of  Bayes  estimation  and  in  the  theory  of  approximation  of  multivariate  functions. 
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From  radial  basis  functions  to  multilayer  perceptrons. 
Substituting  identity  (7)  in  the  Radial  Basis  Functions  expansion 

/(x)  =  f;c„J(||x-t.|n  (8) 

<M=1 

we  obtain 

/(x)  =  Sc.A(x.t,  +  «„),  (9) 

cr=l 

where  we  have  defined 

H(x)  =  H(-2x)  ,  h  =  -i(l  +  ||t,||')  .  (10) 

We  notice  that  eq,  (10)  is  the  expansion  corresponding  to  a  multilayer 
perceptron  network.  The  only  difference  is  that  while  in  the  multilayer 
perceptron  network  the  bias  parameter  0o  is  allowed  to  vary  along  the 
real  line,  in  this  case  it  is  costrained  to  lie  in  the  interval  (— oo,  — i].  We 
therefore  can  conclude  that,  when  inputs  are  normalized,  given  the  RBF 
network  of  eq.  (8)  with  activation  function  H  it  is  always  possible  to 
define  a  multilayer  perceptron  with  the  same  number  of  units  and  with 
activation  function  H  that  computes  the  same  function.  The  synaptic 
weights  connecting  the  input  and  the  hidden  layer  are  the  centers  of  the 
Radial  Basis  Functions  expansion,  and  the  bias  parameters  are  uniquely 
determined  by  the  synaptic  weights. 

From  multilayer  perceptron  to  radial  basis  functions. 

In  the  previous  case  we  have  seen  that  Radial  Basis  Functions  can 
be  simulated  by  multilayer  perceptron.  This  is  not  surprising  since  a 
Radial  Basis  Functions  unit  has  one  parameter  less  than  a  multilayer 
perceptron  unit.  For  the  same  reason  we  cannot  expect  to  simulate 
a  multilayer  perceptron  unit  with  d  inputs  in  terms  of  a  Radial  Basis 
Functions  unit  with  the  same  number  of  inputs.  However  this  may 


be  possible  if  we  add  to  the  Radial  Basis  Functions  units  a  dummy 
input,  so  that  the  coordinate  of  the  center  corresponding  to  the  dummy 
variable  gives  the  missing  parameter.  Let  us  consider  the  multilayer 
perceptron  expansion: 


N 

/(x)  =  5Zco<r(x- Wo +  .  (11) 

a=l 

Using  identity  (7)  we  obtain: 

/W  =  -  Woll^  +  do)  ,  (12) 

a=l  ^ 

where  we  have  defined 

We  now  rewrite  the  expansion  as  a  Radial  Basis  Functions  expansion 
in  d  +  1  dimensions,  in  which  one  of  the  input  variables  is  kept  fixed, 
and  equal  to  some  number  (1,  for  simplicity).  Introducing  the  d  +  1 
dimensional  vectors 


x  =  (x,l),  Wo  =  (w,t;o)- 
we  can  rewrite  the  expansion  (12)  as 

/(x)  =  Ca<y(-^||x  -  Woll*  +  i(l  -  Vo)^  +  do)  .  (13) 

Expansion  (13)  becomes  a  Radial  Basis  Functions  expansion  if  one  of 
the  two  follov/ing  conditions  is  satisfied: 

1.  There  exists  Va  such  that 

1(1  -  v.r  =  -d,  ;  (14) 
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2.  There  exist  functions  gi  and  g2  such  that 


+  y)  =  9iix)92iy)  ■  (15) 

In  the  first  case,  in  fact,  the  multilayer  perceptron  expansion  becomes 

/(X)  =  53  Ca^dlx  -  W„in  (16) 

Ot=l 

where  we  have  defined  the  function  h{x)  =  <t(  -^x).  In  the  second  case 
the  multilayer  perceptron  expansion  becomes 

/(x)  =  13cUi(l|x-Wain  .  (17) 

a=l 

where  <4  =  y'lido,  +  5(1  -  «a)^). 

Remarks 


—  In  case  (1),  a  solution  to  eq.  (14)  does  not  exist  unless  the  condi¬ 
tion 


l(i+IKII")  +  «'.<o 

is  satisfied.  This  condition  on  the  weights  and  on  the  bias 
parameter  9o  defines  a  subclass  of  multilayer  perceptrons  that 
can  be  simulated  by  a  Radial  Basis  Function  network  with  the 
same  number  of  units  and  a  dummy  input,  and  therefore  the  same 
number  of  parameters. 

-  In  case  (2),  evaluating  eq.  (15)  first  at  x  =  0  and  y  —  0  w  ^  obtain 
that 


<7i(a;) 


5a(0) 


92{y)  = 


<^{y) 

9x{0)  ’ 
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and  adter  some  algebra 


cr{x  +  y) 


g(g)<y(y) 

<7(0) 


(18) 


It  is  well  known  that  this  functional  equation  has  only  one  class 
of  solutions,  given  by 


a{x)  ~  ce®  ,  'ic^  R  . 

Therefore  radial  units  can  simulate  arbitrary  multilayer  percep- 
tron  units  only  in  the  case  in  which  the  activation  function  of  the 
multilayer  perceptron  network  is  an  exponential  function.  In  this 
case  the  corresponding  Radial  Basis  Functions  network  is  a  Gaus¬ 
sian  network.  In  fact,  setting  a{x)  =  e'  in  eq.  (13)  we  obtain  the 
expansion 


assl 

Notice  also  that  in  this  case  there  is  no  need  of  dummy  input, 
since  the  bias  term  can  be  dropped  out  by  a  redefinition  of  the 
coefficients,  due  to  the  property  (18)  of  the  exponential  function. 


3  Sigmoid  MLPs  and  Gaussian  GRBFs 


In  this  section,  we  compare  the  sigmoid  and  Gaussian  function  for  normal¬ 
ized  input  vectors  ||x||  =  1.  Under  these  conditions  the  sigmoidal  function 
becomes  a  radial  function,  that  can  approximate  any  Gaussian  function  well 
enough  within  a  certain  range  of  the  bias  parameter  9. 

For  normalized  inputs,  any  ridge  functions  is  equivalent  to  an  appropriate 
radial  function.  Consider  the  sigmoid  function  given  by 


<t(w  .  X  -I-  0)  = 


1 


w  e  R*',  0  e  R 


1  -f  exp(-(w  •  X  +  0)) 

The  corresponding  radial  function  parametrized  by  A  G  R  is  : 


10 


s<t(w  •  X  +  0) 


_ r _  (a 

1 +exp(-(w  •x  +  0))  ' 

l  +  C(A)exp(A||x-t||') 


€  R,  w  €  R'^,  0  €  R) 
(C(A)  >  0)  (19) 


where  we  have  defined  : 


t 


w 


C(A)  =  exp(-(0  +  A(1  + 


(20) 


3.1  Sigmoids  units  can  approximate  Gaussian  units 

The  existence  of  free  parameter  A  indicates  that  any  element  of  the  one- 
parameter  family  given  by  (19)  is  equivalent  to  each  other  for  normalized 
inputs.  To  compare  Gaussians  and  the  radial  functions  associated  with  sig¬ 
moids,  we  should  measure  the  discrepancy  between  the  two  functions.  For 
the  purpose  of  the  comparison,  we  first  take  the  closest  function  to  a  Gaus¬ 
sian  among  all  the  elements  in  the  one-parameter  family  defined  above.  It 
turns  out  that  the  radial  function  (19)  approximates  the  Gaussian  function 
well,  if  C(A)  1  holds  (see  figures  2  and  3).  According  to  this  observation, 
adopting  C(A)  as  a  measure  of  closeness  to  the  Gaussian,  we  consider  the 
function  whose  parameter  A*  corresponds  to  the  maximum  of  C(A)  : 


C(A*)  =  mMC'(A) 

Solving  the  equation  dC I dX  =  0,  we  get  : 


A*  =  ||w||/2 

Substituting  A  =  ||w||/2,  we  obtain  the  following  radial  function  R**  — ♦  R, 
which  has  a  center  on  a  unit  sphere 


s<7(w  •  X  -f  ^) 


s 

l+C(«,||w||)exp(M||x- ,1*1,11*) 


W  l|wl|)  =  exp(-(0  +  ||w||)) 


(21) 


11 


o 


C  =  5  -  C  =  0.1  -  C  =  0.001 


Figure  2:  The  radial  function  /(r)  =  1/(1  +  Ce'’*)  associated  with  sigmoids 
(see  eq.  19),  for  3  different  values  of  C:  C  =  5, 10“\  10“^. 
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Gaussian 


C  =  1 


C  =  0.1 


C  =  0 


.001 


Figure  3:  The  gaussian  function  /(r)  =  e"’’*  and  the  radial  functions 
/(r)  =  for  5  different  values  of  C:  C  =  1, 10“\  10“^.  The  role  of 

the  sccde  factor  1  +  C  is  to  enforce  the  value  at  the  origin  to  be  1,  to  ease 
the  comparison  with  the  gaussian. 


As  shown  in  figures  (2)  and  (3),  the  radial  function  (21)  is  quite  similar  to 
Gaussian  functions  if  C  satisfies  C  ^  1,  or  equivalently,  if  the  bias  parameter 
0  satisfies 


9  <  -llwll  (22) 

This  implies  that  a  MLP  network  can  always  simulate  any  Gaussian  GRBF 
network  well  enough,  if  both  of  the  networks  consist  of  the  same  number  of 
units  (of  course  the  MLP  network  has  one  more  free  parameter  per  hidden 
unit  tham  a  GRBF  network),  since  0  can  always  be  chosen  to  satisfy  (22). 

3.2  The  Gaussian  Radial  Basis  Function  on  the  sphere 

Let  us  first  consider  a  Gaussian  function  for  normalized  inputs.  When  the 
d-dimensional  input  vectors  x  €  R**  are  normalized  as  ||x||  =  1,  for  any 
Gaussian  function,  the  following  relation  holds: 

cexp(-/i^||x  -  tf )  =  c'(A)exp(-/(A)=*||x  -  t'(A)||2) 
where  A  €  R  (A  0)  is  an  arbitrary  parameter  and 

</(A)  =  cexp(~M^(l  -  A)(l  -  /(A)  =  t'(A)  =  i 

This  shows  the  representation  cexp('-^*||x  —  t|p)  has  redundancy  for  nor¬ 
malized  inputs.  For  example,  for  any  Gaussian  function  cexp(— ^^||x  — 1||^), 
the  following  types  of  Gaussians,  which  are  given  by  setting  A  =  ||t||  and 
A  =  respectively,  are  equivalent  for  normalized  inputs. 

cexp(-/i*||x  -  t|n  =  c'exp(-/i*|lt|l||x  -  -jj^ll’)  (c'  =  cexp(-fi^(l  -  ||t||)2)) 

=  c"exp(-||x  -  /iH||*)  (c"  =  cexp((l  -  -  ■^))) 

The  above  indicates  that  the  total  number  of  free  parameters  for  a  Gaussian 
for  normalized  inputs  is  d  -f  1,  and  that  we  may  use  either  Gaussians  with 
normalized  centers. 
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or  those  given  by 


cexp(-/i*||x  -  t||*),  ||tll  =  1 


cexp(-llx-tf) 

as  basis  functions  for  normalized  inputs. 

3.3  Can  Gaussian  units  approximate  sigmoid  units? 

iFrom  the  above  observation,  we  see  that  MLP  can  be  more  flexible  than 
Radial  Basis  Functions  for  normalized  inputs,  if  the  total  number  of  units  is 
the  same,  and  therefore  the  total  number  of  parameters  is  larger  for  MLPs. 
This  is  based  upon  one-to-one  comparison  of  each  unit  of  the  networks.  We 
expect  that  it  may  be  possible  to  approximate  the  radial  function  (21)  using 
a  set  of  m  Gaussians  with  the  same  center. 

j^P^~£a,e-V(4.>0)  (23) 

Figure  (4)  shows  the  experimental  results  of  approximating  C  =  0.01,0.0001 
using  3  Gaussians  for  each  radial  function  (sigmoid).  The  results  imply  the 
possibility  of  approximating  a  sigmoid  by  a  set  of  Gaussians.  The  number 
of  parameters  of  a  sigmoid  is  d  -I-  2.  However,  d  -f  2m  —  1  parameters  are 
required  for  Gaussians  to  approximate  a  sigmoid  according  to  (23)  (thus  in 
the  experiments,  the  total  number  of  pzu'ameters  of  a  sigmoid  and  a  set  of 
Gaussians  are  d +  2  and  d  -|-  5,  respectively.)  In  this  subsection,  we  consider 
the  possibility  of  approximating  a  sigmoid  by  a  set  of  Gaussians  with  similar 
number  of  parameters. 

3.3.1  Approximation  by  Gaussians  with  Constant  Scales 

In  (Poggio  and  Girosi,  1990a)  ,  we  have  extended  our  learning  theory  based 
on  regularization  and  proposed  the  use  of  radial  basis  functions  at  multiple 
scales.  To  explore  the  possibility  of  approximating  MLP  by  Multiscale  GRBF 
with  a  similar  number  of  parameters,  we  consider  the  following  basis  function 
with  constant  scales 
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6  1  I  I  I  I  I  I 

-4  -3  O  3  4  ■ 

b. 


Figure  4:  Approximation  of  the  radial  function  /(r)  =  1/(1+Cexp(r^))  (plot 
1)  by  superposition  of  3  Gaussians  (plot  2):  (a)  C  =  0.01,  (b)  C  =  0.0001 
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=  (||t,||  =  l)  (24) 

>=i 

which  approximate  the  radial  function  (21 ) 


_ 1 _ 

l  +  Cexp(2^||x-t„|n 

In  this  case,  the  above  multiscale  function  <f>a  has  d+m  — 1  parameters.  Since 
the  center  ta  and  the  input  x  are  normalized,  the  condition 


0<||x--t,||<2 

is  satisfied.  Therefore  the  coefficients  {oj}  are  determined  solving  the  mini¬ 
mization  problem  : 


f  { - sr-a  —  — »  min  , 

io  h -I- ^  ^ 

and  are  given  by  the  solution  of  the  following  linear  system: 


f/a  s=  V  , 

Jo  Jo  1 


-l-  Ce' 


j;3-dr  . 


In  Fig.  5,  6  and  7  we  show  experimental  results  of  the  approximation 
error  for  (m  =  3,4,5)  (the  number  of  parameters  of  each  scheme  is  thus 
d  -f  2,  d  -I-  3,  d  -h  4,  respectively).  The  approximation  errors  are  evaluated  as  : 


'  “  (1  +  Cef’y 

In  the  experiments,  the  scales  {Bj}  are  given  by  the  results  of  experiment 
(23).  The  results  show  that,  if  the  length  of  the  weight  is  not  too  large, 
sigmoid  for  normalized  inputs  can  be  well  approximated  by  superposition  of 
Gaussian  with  the  same  number  of  parameters. 
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3  Sea  I e  s 


Plot  1  -  Plot  2  -  Plot  3 

Plot  4  -  Plot  5  -  Plot  6 


Figure  5:  Approximation  of  radial  function  by  Multiscale  Gaussian:  3  scales. 
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app rox I ma t T on  error 

.00  0.01  0.02  0.03  0.04 


4-  Sea  I  e s 


Plot  1  -  Plot  2  -  Plot  3 

Plot  4  -  Plot  5  -  Plot  6 


Figure  6:  Approximation  of  radial  function  by  Multiscale  Gaussian:  4  scales. 
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app r ox T ma t T on  error  ( 1 Oo— 1 ) 

.00  0.02  0.04  0.06  0.08  0.10  0.12  0.14 
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Sco  I  e  s 
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Plot  1 
Plot  4 


-  Plot  2 

-  Plot  5 


Plot  3 
Plot  6 


Figure  7:  Approximation  of  radial  function  by  Multiscale  Gaussian:  5  scales. 
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3.3.2  Approximation  by  Gaussians  with  Fixed  Ratio  of  Scales 

In  the  previous  subsection,  we  have  shown  that  superposition  of  Gaussians 
could  approximate  sigmoid  units  very  well  for  certain  range  of  parameters 
with  the  same  (or  similar)  number  of  parameters.  In  this  subsection  we 
consider,  as  another  possibility  of  approximating  any  sigmoid  with  MuUiscale 
GRBF,  the  following  basis  function  : 

(iit.ii  =  1)  (25) 

j=i 

where  ratio  of  scales  Bj  {j  =  1,2,3)  are  constants.  The  total  number  of 
parameters  of  this  basis  function  is  d  -f-  3.  In  Figure  (8),  we  show  some 
results  of  approximation  experiments.  In  the  experiments,  {Bj}  are  given 
from  the  results  of  experiment  (23)  as  before  and  {oj}  are  determined  so 
that 


fOO  1 


min 


Optimal  {oj}  are  obtained  by  solving  the  linear  equations  : 


f/a  =  V 
1  /  TT  1 

""  2  V  SiTe"’  1  +  Ce-VB. 

Table  (1)  shows  the  approximation  errors  given  by  : 


dr 

(H-Ce^)2 


4  Experiments 

In  the  previous  section,  we  have  shown  that  sigmoid  MLPs  can  always  sim¬ 
ulate  any  gaussian  GRBF  network  when  both  of  the  networks  consist  of  the 
same  number  of  units  (sigmoids  and  Gaussians).  The  converse  is  also  true 
if  MLP’s  bias  parameters  are  restricted  to  a  certain  range  given  by  (22).  To 
investigate  whether  this  constraint  is  always  satisfied  in  practice,  numerical 
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i  n  I  I  I  I  I  I 

-•  -«  -s  o  a  4  m 

C. 

Figure  8:  Approximation  of  the  radial  function  /(r)  =  1/(1  +  Cexp(r^)) 
(plot  1)  by  superposition  of  3  Gaussians  with  fixed  ratio  of  scales  (plot  2): 
(a)  C  =  0.1,  (b)  C  =  0.001,  (c)  C  =  0.00001. 
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IQ 

1 

0.1 

0.01 

0.001 

0.0001 

0.00001 

IB 

0.00165 

0.00243 

0.0171 

0.0204 

0.0162 

0.0307 

Table  1:  The  relative  approximation  error  c  that  is  obtained  approximating 
a  sigmoid  with  a  multiscale  GRBF  in  which  the  ratio  of  the  variances  of  the 
Gaussians  are  kept  fixed. 

experiments  were  carried  out.  Even  if  the  original  inputs  are  not  normalized, 
it  is  always  possible  to  get  normalized  inputs  by  adding  one  more  dimension 
as  follows: 


X  —  (xj,  .  .  .  ,X<i)  |xi|  <  i 


X  — +  X  —  (xj, . . . ,  Xj,  Xj^j)  x^  —  (*  ~  •  •  • » d)  Xj^j  —  1  —  X,- 

In  the  experiments,  we  tried  to  approximate  one  dimensional  functions  which 
are  mapped  onto  a  (hemi)circle  using  the  technique  shown  above  as  : 


F(x,  y)  =  F{x)  |x|  <1,  y  =  Vl  - 
Functions  used  in  our  experiments  are  as  follows  (Fig  (9): 

•  Fi  =  X 

•  Fj  =  e”®*  cos(|7rx) 

•  Fa  =  cos(|7rx) 

p  «n(3irT) 

■'  4  —  3,x 

•  Fs  =  e“®  sin(5x) 


They  were  approximated  with  2  dimensional  MLP(sigmoids),  constrained 
MLP,  GRBF  (Gaussians),  and  Multiscale  Gaussians  given  below. 
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•  MLP  (Sigmoids) 


f{x,y) 


K 

51  •  X  +  dc),  (j{u)  = 

0=1 


1 

1  +  e-“ 


•  Constrained  MLP(Sigmoid) 


f{x,y)  =  CaCr(Wa  •  X  +  0^),  6^  =  -(i||w„||2  +  l) 
0  =  1  ^ 

C{6o„  ||w„||)  =  exp(i(||w^|l  -  2)2)  >  1 

•  GRBF  (Gaussians) 


f{x,y) 


5^Caexp(-||x-t„||2) 


ossl 


•  Multiscale  GRBF 


K  m 


f{x,y)  =  mrco/3exp(-/i2^||x-to|H 

0=1 0=1 


Approximation  performances  were  compared  according  to  normalized  L2  and 
Loo  measured  on  both  training  set  and  evaluation  set  as  follows. 

,  (Xi)  -  max£,  If  (X,)  -  /(x,)| 

“  £2:1  ^’(XjF  ’  “  maxSi  |f(x,)| 

,,  _  EiLi(f  (Xp)  -  fix,)?  ,  |-F(X,)  -  /(x„)| 

"  E".  f (x,)>  '  ”  max".  |f(x,)|  ■ 

where  {x,}jlj,  {Xp}^i  are  training  set  and  evaluation  set,  respectively.  In 
our  experiments,  these  sets  are  randomly  chosen  and  the  number  of  points 
in  the  training  set  and  evaluation  set  are  =  20  and  M  =  100  respectively. 
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In  Table  2  we  report  the  training  and  test  errors,  both  in  the  and  Loo 
norm,  that  we  obtained  applying  the  techniques  described  above  to  the  set 
of  test  functions  Fi, . . . ,  F5.  In  table  3  we  report  the  value  of  C[6q,  ||wq||) 
and  ||Wo||  for  the  hidden  units  of  the  MLP  network.  These  results  show 
that  condition  (22)  is  not  always  satisfied.  To  see  how  target  functions  were 
approximated  by  MLP  (sigmoids),  we  show  in  fig.  (10),  (11),  (12),  (13)  and 
(14)  the  solutions  obtained  by  our  experiments. 

5  Conclusions 

This  paper  explores  the  relationships  between  MLPs  and  GRBF.  Its  main 
point  is  that  under  the  condition  of  normalized  inputs  the  two  representa¬ 
tions  are  very  similar,  though  not  identical.  This  is  somewhat  surprising, 
especially  since  MLPs  anf  RBF  can  be  regarded  cis  representative  of  two  dif¬ 
ferent  ways  of  approximating  functions  based,  respectively  on  scalar  products 
and  euclidean  distances. 

For  normalized  d-dimensional  input  vectors  sigmoidal  MLPs  can  always 
approximate  essentially  arbitrarily  well  a  given  GRBF  (which  has  M  less 
parameters,  where  M  is  the  number  of  hidden  units).  The  converse  is  true 
only  for  a  certain  range  of  the  bias  parameter  in  the  sigmoidal  unit.  We  have 
verified  experimentally  that  in  MLPs  trained  with  backpropagation  the  bias 
parameters  do  not  always  respect  this  condition.  Therefore  MLPs  gener¬ 
ated  by  backpropagation  cannot  in  general  be  approximated  arbitrarily  well 
by  GRBFs  with  the  same  number  of  hidden  units  (and  d  less  parameters). 
GRBFs  that  have  3  Gaussians  per  center  and  therefore  the  same  number  of 
parameters  as  a  MLP  network  with  the  same  number  of  hidden  units  yield 
a  reasonable  approximation  of  a  given  sigmoid  MLP  but,  again,  not  for  all 
parameters  values.  Within  the  framework  of  this  paper,  there  seems  to  be 
no  simple  answer  to  the  question  of  what  is  the  relation  between  MLPs  and 
radial  HyperBF  with  full  or  diagonal  W  (a  HyperBF  network  with  diag¬ 
onal  W  has  the  same  number  of  psirameters  as  a  MLP  network  with  the 
same  number  of  hidden  units).  All  this  implies  that  MLPs  network  are  -  for 
normalized  inputs  -  more  powerful  than  GRBFs  (and  of  course  than  RBF) 
networks  with  the  same  number  of  hidden  units.  Notice  that  the  property  of 
being  more  powerful  is  not  necessarily  an  advantage  here,  since  the  number 
of  parameters  is  larger  (parameters  to  be  learned  are  1  per  hidden  unit  in  the 
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Table  2:  Training  and  test  errors,  both  in  the  L2  and  Z-oo  norm,  for  the  set 
of  test  functions  Fi, . . . ,  F5. 
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Plot  1 


Plot  2 


Plot  3  -  PI 


Figure  10:  Approximation  of  the  function  Fj  by  a  MLP  with  2  hidden  units: 
(plot  1)  Fi  =  X,  (plot2)  basis  1,  (plot  3)  basis  2,  (plot  4)  result.  Notice  the 
complete  overlapping  between  the  original  function  (plot  1)  and  the  MLP 
result  (plot  4). 
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Plot  1 
Plot  5 


Plot  2 


Plot  3 


Plot  4 


Figure  11:  Approximation  of  the  function  F2  by  a  MLP  with  3  hidden  units: 
(plot  1)  F2  =  cos(|7rx) ,  (plot  2)  basis  1,  (plot  3)  basis  2,  (plot  4)  basis  3, 
(plot  5)  result.  Notice  the  complete  overlapping  between  the  original  function 
(plot  1)  and  the  MLP  result  (plot  5). 


-1.0  -0.5  0.0  0.5  1.0 


-  Plot  1  -  Plot  2  -  Plot  3 

-  Plot  4  -  Plot  5  -  Plot  6 


Figure  12:  Approximation  of  the  function  F3  by  a  MLP  with  3  hidden  units: 
(plot  1)  F3  =  cos(|7ra:),  (plot  2)  basis  1,  (plot  3)  basis  2,  (plot  4)  basis 

3,  (plot  5)  basis  1  +  basis  2,  (plot  6)  result.  Notice  the  complete  overlapping 
between  the  original  function  (plot  1)  and  the  MLP  result  (plot  6). 
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Figure  13:  Approximation  of  the  function  by  a  MLP  with  3  hidden  units: 
(plot  1)  F4  =  22||^,(plot  2)  basis  1,  (plot  3)  basis  2,  (plot  4)  basis  3,  (plot 
5)  basis  1  +  basis  2,  (plot  6)  result.  Notice  the  complete  overlapping  between 
the  original  function  (plot  1)  and  the  MLP  result  (plot  6). 
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Plot  1  -  Plot  2  -  Plot  3 

Plot  4  -  Plot  5  -  Plot  6 


Figure  14:  Approximation  of  the  function  Fs  by  a  MLP  with  3  hidden  units: 
(plot  1)  F5  =  e“®sin(5x),  (plot  2)  basis  1,  (plot  3)  basis  2,  (plot  4)  basis  3, 
(plot  5)  basis  1  +  basis  2,  (plot  6)  result.  Notice  the  complete  overlapping 
between  the  original  function  (plot  1)  and  the  MLP  result  (plot  6). 
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iQfusiQiiniQiin 

(C(f?3,||W3||),||W3||) 

iia 

(0.0509,2.14) 

m 

(7.67,6.30) 

(0.829,0.492) 

“(TT>rTF^7734)~ 

iia 

(11.4,10.5) 

(0.00371,0.504) 

"insirrFWT)” 

la 

(7.48,8.01) 

(1.60,8.63) 

(0.0271,7.70) 

m 

(1.25,2.37) 

(0.182,6.63) 

“TITxTF^rjlei)” 

Table  3:  The  values  of  ||Wa||)  and  ||Wo||  corresponding  to  the  hidden 
units  of  the  MLP  network  (a  =  1,2,3). 


case  of  RBFs,  d  +  1  in  the  case  of  GRBF  and  d  +  2  in  the  case  of  MLPs)  and 
actual  performance  should  be  considered  (see  Maruyama  et  al.,  1991,  for  an 
experimental  comparison  between  different  approximation  schemes). 

How  critical  is  the  condition  of  normalized  inputs  for  the  above  argu¬ 
ments?  Mathematically  there  is  no  simple  way  of  extending  them  to  inputs 
that  are  not  normalized.  We  have  already  mentioned  the  very  recent  re¬ 
sults  of  Poggio  and  Girosi,  which  show  that  HyperBF  and  some  set  of  ridge 
function  approximation  schemes  are  regularization  networks  corresponding 
to  two  different  classes  of  stabilizers  (and  therefore  different  priors  in  the 
associated  Bayesian  interpretation).  In  the  context  of  the  arguments  of  this 
paper,  it  is  interesting  to  remark  that  normalized  inputs  have  been  used  in 
several  well-known  applications,  with  good  reported  performances.  NETtalk 
is  one  example.  In  NETtalk  the  input  layer  consists  of  7  groups  of  units 
and  each  group  contains  29  units  (i.e.  number  of  units  in  the  input  layer  is 
203).  Each  group  corresponds  to  a  letter  presented  to  NETtalk,  and  each 
units  represents  an  alphabet  (including  period,  blank,  etc.).  Therefore,  when 
input  letters  are  presented,  only  one  unit  among  those  of  each  group  has  the 
value  “1”  and  the  others  have  “0”  as  : 


group  1  group  7 


Clearly,  ||x|p  =  const  =  7  always. 

For  normalized  inputs  it  seems  therefore  that  there  is  a  strict  relation, 
almost  an  equivalence,  between  the  vector  of  the  weights  w  in  a  MLP  network 
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and  the  vector  of  the  centers  in  the  associated  GRBF  network.  This  fact  has 
several  potentially  interesting  consequences.  The  first  one  is  that  it  may  be 
possible  to  efficiently  initialize  the  weights  of  a  MLP.  In  particular,  if  we  have 
a  sufficient  number  of  units  (same  as  data  size),  we  can  design  an  optimal 
initial  network,  utilizing  the  properties  of  RBF  and  of  the  sigmoid-Gaussian 
quasi-equivalence.  Several  fast  learning  algorithms  have  been  proposed  for 
radial  (basis)  functions.  They  include  Moody  and  Darken’s  method  (Moody 
and  Darken,  1989)  based  on  the  k-mean  algorithm  and  Kohonen’s  LVQ.  Our 
arguments  imply  that  it  should  be  possible  to  exploit  these  algorithms  also 
for  determining  an  initial  structure  of  a  MLP  network. 

The  second  point  has  to  do  with  the  common  interpretation  of  the  weights. 
T.  Sejnowski  (Sejnowski  and  Rosenberg,  1987)  writes  “  In  NETtalk,  the 
intermediate  code  was  semidistributed  -  eiround  15  %  of  the  hidden  units 
were  used  to  represent  each  letter-to-sound  correspondence.  The  vowels  and 
the  consonants  were  fairly  well  segregated,  arguing  for  local  coding  at  a 
gross  population  level  (something  seen  in  the  brain)  but  distributed  coding 
at  the  level  of  single  units  (also  observed  in  the  brain).”  Our  result  seem  to 
suggest  that  the  “intermediate  code”  may  often  be  a  collection  of  appropriate 
“templates” ,  in  particular  some  of  the  examples. 
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